This project aimed to investigate the differences in physical activity patterns between depressed and non-depressed individuals using data collected from actigraphs. The study utilized a depression dataset containing sensor data collected from patients, with the goal of extracting relevant features and utilizing machine learning algorithms to classify and differentiate between depressed and non-depressed groups.
The data was pre-processed by checking for missing values in each file, followed by the extraction of various features such as the average hourly activity, average weekday activity, and average activity within 6-hour intervals. Then these 55 new datasets were combined in one dataset. Exploratory Data Analysis (EDA) involved the utilization of different types of plots to identify patterns associated with depressed and non-depressed individuals. The analysis revealed that the activity data extracted from the sensor recordings showed promising potential as a feature for diagnosing depression in patients. In addition, the project incorporated the use of the tsfresh library to extract additional features from the data. These extracted features, along with the initially obtained ones, were then utilized as inputs for the machine learning model. By employing this approach, the project aimed to classify and differentiate between depressed and non-depressed groups. The results obtained through machine learning algorithms demonstrated the feasibility of performing classification based on the available features.
However, it is important to note that the limited number of observations in the dataset (55) may have impacted the overall reliability and generalizability of the results. The small number of observations affects the data splitting ratio for training and testing, which can influence the classification metrics. A larger dataset would improve the model’s performance and enhance the robustness and accuracy of the results.
Furthermore, depression is a complex mental health condition influenced by various factors, and activity level alone might not capture the complete picture. Incorporating additional information, such as psychological questionnaire data, could enhance the accuracy and reliability of the classification results.
2 Introduction
Undeniably, the Internet of Medical Things (IoMT) is used to control the individual health. Wearable IoMT devices, such as watches which are widely used, can track users’ activities in dynamic surroundings and can deliver interventions for mental health disorders, such as defining an activity goal and encouraging the users to reach their goals by notifications, or remind them to do a meditation. Wearable activity trackers collect fine-grained sensor data characterizing the behavior and physiology of users providing objective measures of physical activity. Digital biomarkers derived from these sensor recordings an aid in the improved monitoring and detection of episodes associated with mental health disorders (Arora et al.(2023)).
The use of on body sensors to monitor personal health has become quite normal these days. This data holds a lot of potential besides measuring the quantity of daily steps or calories burned, since continuous recordings of heart rate and activity levels usually are collected. There is an increasing awareness in the field of psychiatry on how these activity data relates to various mental health related issues such as changes in mood, personality, inability to cope with daily problems or stress and withdrawal from friends. Depression is number one of the most frequent disorders and the current trend is indicating that the prevalence will increase even more in the coming years. Dealing with depression can be demanding since it can create physically, economically and emotionally problems often leading to problems with work and sick leaves (Garcia-Ceja et. al.(2018)).
The purpose of this project is to investigate the differences in physical activity patterns between depressed and non-depressed individuals using data collected from actigraphs. The study is done on the depression dataset, which contains sensor data collected from patients. In this project, different features are extracted out of these data such as the activity patterns in different days of the week and different hours of the day. Then, these features are compared between the condition and control groups to underestand the differences between them. Ultimately, we use these features as the input of machine learning classification algorithms to see if they can differenciate between depressed and non-depressed groups.
3 Data Collection
Actigraphy is a quantitative method both to measure gross motor activity, and to examine diurnal variations in motor activity. Motor activity refers to any voluntary or involuntary movement of the body that is generated by the contraction of skeletal muscles and includes a wide range of activities such as walking, running, jumping, writing, typing, and even simple movements such as blinking and breathing. For collecting the data, motor activity was monitored with an actigraph worn at the right wrist of the chosen participents (Actiwatch, Cambridge Neurotechnology Ltd, England, model AW4). The sampling frequency is 32Hz and movements over 0.05 g are recorded. “g” refers to the acceleration due to gravity, so a movement of 0.05 g would be a small acceleration. This means that only relatively strong movements or vibrations are being recorded, while smaller or less significant movements are being ignored. A corresponding voltage is produced and is stored as an activity count in the memory unit of the actigraph watch. The number of counts is proportional to the intensity of the movement. Total activity counts were recorded for one minute intervals for a continuous period of two weeks. Both patients and controls were instructed to wear their actigraphs at all times except when taking a shower (Berle et al.(2010)).
The dataset contains motor activity recordings of 23 unipolar and bipolar depressed patients and 32 healthy controls. Five subjects were hospitalized during their data collection period, and 18 were outpatients.
3.1 Dataset Structure
The dataset contains 2 folders of condition and control and a scores.csv file. For each patient, a csv file containing the actigraph data collected over time is provided.
3.1.1 Condition and Control Folder
In these folders, we have 55 csv files in total. Each csv file has three columns of:
timestamp (one minute intervals),
date (date of measurement),
activity (activity measurement from the actigraph watch).
Each dataset as different size depending on the duration of the study for each individual. Here is an example of the dataset.
Code
# Reading the sensor datasetsrawdfs = {}directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = pd.read_csv(file_path) rawdfs[name] = df df['timestamp'] = pd.to_datetime(df['timestamp']) # new added df['month']= df['timestamp'].dt.month # new addedrawdfs['control_22'].head()
timestamp
date
activity
month
0
2004-02-17 09:00:00
2004-02-17
7
2
1
2004-02-17 09:01:00
2004-02-17
9
2
2
2004-02-17 09:02:00
2004-02-17
9
2
3
2004-02-17 09:03:00
2004-02-17
7
2
4
2004-02-17 09:04:00
2004-02-17
9
2
3.1.2 Score Dataset
The Score Dataset is a dataset with the shape of (55*12). It contains 55 rows which are the data for 32 control and 23 condition individuals. The scores.csv file contains the following columns:
number (patient identifier),
days (number of days of measurements),
gender (1 or 2 for female or male),
age (age in age groups),
afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I): in the dataset, 15 individual out of condition group had unipolar depression, 7 individual of them had bipolar I, and 1 of them had bipolar II.
melanch (1: melancholia, 2: no melancholia): in the dataset, 19 individual out of condition group had no melancholia, one of them had melancholia, and 3 of them were undefined.
inpatient (1: inpatient, 2: outpatient): regarding the dataset, 5 individual out of condition group were hospitalized during the study.
edu (education grouped in years),
marriage (1: married or cohabiting, 2: single),
work (1: working or studying, 2: unemployed/sick leave/pension),
madrs1 (MADRS score when measurement started),
madrs2 (MADRS when measurement stopped).
MADRS (Montgomery-Asberg Depression Rating Scale) measures the severity of the patients’ depressive state. The MADRS scoring instructions indicate that a total score ranging from 0 to 6 indicates that the patient is in the normal range (no depression), a score ranging from 7 to 19 indicates “mild depression”, 20 to 34 indicates “moderate depression,” a score of 35 and greater indicates “severe depression”, and a total score of 60 or greater indicates “very severe depression. In the
print('count of mild depression regarding MADRS score:\n',len(scoredf.loc[(scoredf['madrs1']>7) & (scoredf['madrs1']<20)]))
count of mild depression regarding MADRS score:
7
Code
print('count of moderate depression regarding MADRS score:\n',len(scoredf.loc[scoredf['madrs1']>19]))
count of moderate depression regarding MADRS score:
16
4 Functions
To perform our tasks easier, we define several functions. A function is a piece of code written to carry out a specified task. We use these functions to bundle a set of instructions that we want to use repeatedly or they are complex. A short description for each function is as follows:
Function
Description
Input
read_raw
This function reads each csv file of timeseries, drops the date variable and changes the timestamp variable into datetime and puts it as the index of the dataframe.
File path
read_filled
This function has the same process of read_raw function, but it checks if there is any timegap between the indexes and fills out the gaps by taking the average of activities of 1 hour before and 1 hour after the timegap.
File path
smoothed_series
This function smoothes the timeseris data in a given window and creates a new dataframe
Dataframe, A number for the window
reduced_series
This function reduces the size of dataframe by taking the average of the activity in a defined time intervals
Dataframe, Interval
fft
This function takes the timeseries and performs the Fourier Transformation on it
Dataframe
compare_plot
This function creates 5 plots for each dataset: 1. count of activity in each minute (raw data); 2. Count of activity in a smoothed data, we replace the raw data with the mean value of a selected windows; 3. Histogram of the raw data; 4. Average of activities in week days and; 5. Count of the activities in a selected interval.
Dataframe, A number for window, Interval, Title
Code
def read_raw(file_path):''' We define this function to read each csv file, change the timestamp and date data type into datetime and create time and weekday column. ''' series = pd.read_csv(file_path).drop('date',axis=1) series['timestamp'] = pd.to_datetime(series['timestamp']) series = series.set_index('timestamp')return series
Code
def smoothed_series(series,window):''' We define this function to smooth the timeseris data in a given window and creates a new dataframe. ''' rolling_avg = pd.DataFrame(series['activity'].rolling(window).mean())return rolling_avg
Code
def reduced_series(series,interval):''' We define this function to reduce the size of dataframe by converting the activity per defined time intervals into activity per hour. ''' hourlist= ['H','2H','3H','4H','6H','8H','12H','24H'] minutelist = ['1T','15T','30T'] daylist = ['D']if interval in hourlist: series = series.groupby(pd.Grouper(freq=interval)).mean().reset_index() series['time'] = series['timestamp'].dt.hourelif interval in minutelist: series = series.groupby(pd.Grouper(freq=interval)).mean().reset_index()else: series = series.groupby(pd.Grouper(freq=interval)).mean().reset_index() series['weekday'] = series['timestamp'].dt.day_name()return series
Code
def read_filled(file_path):''' We define this function to read each csv file, check if there is any timegap between the indexes and fill them out by taking the average of activities of 1 hour before and 1 hour after the time gap. ''' series = pd.read_csv(file_path) series.drop('date',axis=1,inplace =True) series = series.set_index('timestamp') series.index = pd.to_datetime(series.index) gap_threshold = pd.Timedelta(minutes=1) new_rows = [] new_index = pd.DatetimeIndex([]) # define new_index outside the loopfor i inrange(1, len(series)): time_diff = series.index[i] - series.index[i-1]if time_diff > gap_threshold: start1 = series.index[i-1] - pd.Timedelta(minutes=60) end1 = series.index[i-1] start2 = series.index[i] end2 = series.index[i] + pd.Timedelta(minutes=60) before_activity = series.loc[start1:end1]['activity'].mean() after_activity = series.loc[start2:end2]['activity'].mean() value = (before_activity+after_activity)/2 new_index = pd.date_range(start=series.index[i-1]+pd.Timedelta(minutes=1), end=series.index[i]-pd.Timedelta(minutes=1), freq='min')for j inrange(len(new_index)): new_row = {'activity': value} new_rows.append(new_row) new_series = pd.DataFrame(new_rows, index=new_index) new_series.index.name ='timestamp' series = pd.concat([series, new_series], axis=0) series = series.sort_index()return series
Code
def fft(df):''' We define this function for fast fourier transformation, we transform the activity column data and create a frequency column. ''' df = df.reset_index()# df['fft'] = np.fft.fft(df['activity'].to_numpy())# df['freq'] = np.fft.fftfreq(len(df), (df['timestamp'][1] - df['timestamp'][0]).total_seconds()) df['fft'] = np.abs(np.fft.fft(df['activity'].to_numpy())) df['freq'] = np.fft.fftfreq(df.shape[0]) df = df.set_index(keys='timestamp')return df
Code
def compare_plot(df,window,interval,title):''' We define this function to create 4 plots for each dataset. ''' fig, axs = plt.subplots(1, 6, figsize=(20, 4))# Raw data axs[0].plot(df['activity']) axs[0].set_title('Time Series Plot')# Smoothed data axs[1].plot(smoothed_series(df,window)['activity'],"b-") axs[1].set_title(f'{window} Minutes Smoothed Time Series')# FFtransformed data axs[2].plot(df['freq'],df['fft']) axs[2].set_title('FFT')# Raw data histogram axs[3].hist(np.array(df['activity']) , density=True , bins=20, edgecolor='black' ,facecolor='green', alpha=0.75) axs[3].set_title('Histogram')# Data per weekday week = reduced_series(df,'D') weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] week['weekday'] = pd.Categorical(week['weekday'], categories = weekday_order, ordered=True) activity_count = week.groupby('weekday')['activity'].mean() axs[4].plot(activity_count.index, activity_count.values,"r-") axs[4].set_title('Weekday Plot')# Reduced data into intervals activity_count = pd.DataFrame(reduced_series(df,interval).groupby('time')['activity'].mean()) axs[5].plot(activity_count,"k-") axs[5].set_title(f'Time Series with {interval} Interval')for ax in axs: ax.set_xticks(ax.get_xticks()[::]) ax.tick_params(axis='x', labelrotation=90) fig.suptitle(title,fontsize=12, y=1.02) plt.show()
5 Data Pre-Processing
5.1 Checking for missing values
5.1.1 Timeseries
After reading each dataset using read_raw() function and checking the missing values using isna() function, we found out that the raw data have no missing values for the variable activity. We first did the process for the raw data and the result showed that there is no missing values for the activity variable. Then we reduced the size of the dataset after using reduced_series() function. As a result, instead of having the data for each minute, we have the average data of each hour by taking the average for each hour. After reducing the size of the dataset, result showed that there were 5 datasets which contained missing values. Regarding the data collection report, the data is collected in 1 minute intervals. By calculating the time differences between each row of ourthe raw datasets, we come to the point that there are time gaps in the datasets which created missing values after using reduced_series function.
Code
# In this chunk of code, We read all the files in condition and control folders and check the missing values.dfs = {}missing = []directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = read_raw(file_path) df.insert(0, 'number', name) dfs[name] = df df_missing = df['activity'].isna().sum() missing.append(df_missing)print('Number of missing values in raw datasets: ',sum(missing))
Number of missing values in raw datasets: 0
Code
# We reduce the size of each dataset by taking the average of activities per hourand then check the missing valuesreduced_dfs = {}reduced_missing = []directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = read_raw(file_path) df = reduced_series(df,interval='H') df.insert(0, 'number', name) df['time_diff'] = df.index.to_series().diff() reduced_dfs[name] = df df_missing = df['activity'].isna().sum() reduced_missing.append(df_missing)print('Number of missing values after reducing the size of raw datasets into hour intervals: ',sum(reduced_missing))
Number of missing values after reducing the size of raw datasets into hour intervals: 5
Here is an example of raw dataset compared to reduced dataset basewd on hourly basis:
Code
dfs['condition_13'].head()
number
activity
timestamp
2005-08-11 09:00:00
condition_13
0
2005-08-11 09:01:00
condition_13
0
2005-08-11 09:02:00
condition_13
0
2005-08-11 09:03:00
condition_13
0
2005-08-11 09:04:00
condition_13
0
Code
reduced_dfs['condition_13'].head()
number
timestamp
activity
time
time_diff
0
condition_13
2005-08-11 09:00:00
262.133333
9
NaN
1
condition_13
2005-08-11 10:00:00
487.300000
10
1.0
2
condition_13
2005-08-11 11:00:00
742.616667
11
1.0
3
condition_13
2005-08-11 12:00:00
458.683333
12
1.0
4
condition_13
2005-08-11 13:00:00
343.083333
13
1.0
The time gaps for all of these datasets were at the same date and same hour (30 March 2003, 2 a.m. to 3 a.m.). The list of the datasets with time gap are listed below:
Only the above-mentioned individuals data was in the date of “2003-03-30”. This time gap is according to daytime saving time in 2003 which took place on 30th March at 2 a.m.
We solved this problem by using read_filled() function. By using this function, after reading each csv file, we check the timpe gaps in each dataset and we add new rows and take the average of activities for one hour before and one hour after the time gap and put them as the activity values in new rows.
By reading all the csv files using read_filled() function, the new result showed no missing values. So we stored the datasets in a dictionary named filled_dfs. The keys in the dictionary are the name of each dataset and the values are the data of the datasets.
Code
# Here we detect which datasets have time gapsunique = {}directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = read_raw(file_path) df.insert(0, 'number', name) df['time_diff'] = df.index.to_series().diff() unique[name] =len(df['time_diff'].unique())gap = [key for key, value in unique.items() if value >2]print('Datasets which contain timegap: ',gap)
Datasets which contain timegap: ['control_1', 'control_31', 'control_32', 'control_6', 'control_7']
Code
# Here we detect which datasets data was collected at 2003-03030datasets_with_date = []directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = pd.read_csv(file_path) df['timestamp'] = pd.to_datetime(df['timestamp'])if'2003-03-30'in df['timestamp'].dt.date.values.astype(str): datasets_with_date.append(name)print('Datasets that contain the date 2003-03-30:', datasets_with_date)
Datasets that contain the date 2003-03-30: ['control_1', 'control_31', 'control_32', 'control_6', 'control_7']
Code
# We check again the missing values of reduced datasets after using the read_filled() functiondfs = {}reduced_missing = []directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = read_filled(file_path) df = reduced_series(df,interval='6H') df.insert(0, 'number', name) dfs[name] = df df_missing = df['activity'].isna().sum() reduced_missing.append(df_missing)print('Number of missing values after filling out the timegap and reducing the size of datasets: ',sum(reduced_missing))
Number of missing values after filling out the timegap and reducing the size of datasets: 0
Code
# We read all of our datasets using read_filled function and store them in a dictionary.filled_dfs = {}directory_path ='data'directory_list = os.listdir(directory_path)for subdir in directory_list: subdir_path = os.path.join(directory_path,subdir)if os.path.isdir(subdir_path):forfilein os.listdir(subdir_path): file_path = os.path.join(subdir_path,file) name = os.path.splitext(os.path.basename(file))[0] df = read_filled(file_path) df.insert(0,'number', name) df.reset_index() filled_dfs[name] = df
5.1.2 Score Dataset
After counting the missing values of the score file, We found out that most of the columns are filled out for condition group and we only have the data of number (which is the id of each condition and control patient, the same as the csv file names), days, gender and age for the control group. So we removed the other columns with missing values.
Code
scoredf.isnull().sum()
number 0
days 0
gender 0
age 0
afftype 32
melanch 35
inpatient 32
edu 2
marriage 32
work 32
madrs1 32
madrs2 32
dtype: int64
Code
# Dropping columns with missing valuesscoredf = scoredf.dropna(axis=1)scoredf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 number 55 non-null object
1 days 55 non-null int64
2 gender 55 non-null int64
3 age 55 non-null object
dtypes: int64(2), object(2)
memory usage: 1.8+ KB
We have a days column in the score dataset which contains the number of days that an individual was under study. On the other hand, by having the csv files for each individual, we can compute the duration of the study for each individual in two ways:
First, the difference of start date and end date.
Second, the length of the dataset for each case.
By comparing the results computed by the above-mentioned methods, no difference was observed. This will also prove that we do not have time gaps in our dataset anymore. However, the comparision between the results with the days column in score dataset showed remarkable differences, which means that the days data is not be correct. However, this data has no significant effect on our study and does not provide a meaningful information and we remove it from our score dataset.
Code
# Duration of data collection for each casedurationlist1 = {}for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] start = df.reset_index().iloc[0,0] end = df.reset_index().iloc[-1,0] duration = pd.Timedelta(end - start) duration =round(duration.total_seconds() / (24*60*60),1) # convert to float durationlist1[name] = durationdurationlist1 = pd.DataFrame(durationlist1.items(), columns=['number', 'method_1_duration'])print("First method's results:\n", durationlist1.head())
First method's results:
number method_1_duration
0 condition_1 16.1
1 condition_10 15.0
2 condition_11 16.0
3 condition_12 15.4
4 condition_13 18.0
Code
# Extracting the duration based on the length of each datasetdurationlist2 = {}for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df_len = df.shape[0] durationlist2[name] =round(df_len/(24*60), 1)durationlist2 = pd.DataFrame(durationlist2.items(), columns=['number', 'method_2_duration'])print("Second method's results:\n", durationlist2.head())
Second method's results:
number method_2_duration
0 condition_1 16.1
1 condition_10 15.0
2 condition_11 16.0
3 condition_12 15.4
4 condition_13 18.0
Code
# Combining the 2 computed durationsdur_comp = pd.merge(durationlist1 , durationlist2 , on='number')dur_comp['diff'] = dur_comp['method_2_duration']-dur_comp['method_1_duration']print("Sum of differences equals to: ", dur_comp['diff'].sum(),'\n',dur_comp.head())
Sum of differences equals to: 0.0
number method_1_duration method_2_duration diff
0 condition_1 16.1 16.1 0.0
1 condition_10 15.0 15.0 0.0
2 condition_11 16.0 16.0 0.0
3 condition_12 15.4 15.4 0.0
4 condition_13 18.0 18.0 0.0
Code
dur_comp_score = pd.merge(scoredf , durationlist1 , on='number')dur_comp_score['diff'] = dur_comp_score['method_1_duration']-dur_comp_score['days']columns= ['number','days','method_1_duration','diff']dur_comp_score[columns]print("Sum of differences equals to: ", dur_comp_score['diff'].sum(),'\n',dur_comp_score.head())
Sum of differences equals to: 398.99999999999994
number days gender age method_1_duration diff
0 condition_1 11 2 35-39 16.1 5.1
1 condition_2 18 2 40-44 27.0 9.0
2 condition_3 13 1 45-49 15.0 2.0
3 condition_4 13 2 25-29 15.0 2.0
4 condition_5 13 2 50-54 14.9 1.9
Code
# Removing days column#| include: falsescoredf.drop('days', axis =1, inplace=True)
5.2 Data Transformation in Score dataset
5.2.1 Creating the target Variable
Considering that ultimately we want to fit a classifier to predict which case is depressed or not, we need to create our target variable. We have the label data of each case, based on the number column, which is condition and control. So we create our new column called “depressed” at the beginning (index 0) of the score dataset. The values of this new column are determined by a set of conditions based on the values in the “number” column. Value 0 refers to non depressed case and value 1 refers to depressed case.
As we can see in the information of the score dataset, the age variable contains “object” values which is a range of numbers. In order to use these data in our machine learning algorithms, we need to transform them into numbers. Each range contained 5 numbers, so we created a new column called “age_numeric” and we decide the middle number of each age range as the representative of them. At the end, we removed the “age” column. We also transformed the gender variable from (1 and 2) into (0 and 1)
Code
# Changing the age variable into numericage_interval_midpoints = {'20-24':22 , '25-29':27 , '30-34':32, '35-39':37 , '40-44':42 ,'45-49':47 , '50-54':52 , '55-59':57 , '60-64':62 , '65-69':67}scoredf.insert(3,'age_numeric',scoredf['age'].map(age_interval_midpoints))scoredf.drop(['age'],axis=1, inplace=True)scoredf['gender'] = scoredf['gender'].apply(lambda x: 1if x ==2else0)scoredf.head()
number
gender
age_numeric
depressed
0
condition_1
1
37
1
1
condition_2
1
42
1
2
condition_3
0
47
1
3
condition_4
1
27
1
4
condition_5
1
52
1
5.3 Completing the Score Dataset
Now that we cleaned our datasets, we complete the score dataset for each case by adding new features that we extract from timeseries. We create 3 groups of features.
The first group of features is the weekday activity of case patient which measures the average activity in each day of the week.
The second group of features is the hourly activity of each case which measures the average activity in each of 24 hours of the day.
For the third group of features, we devided 24 hours into 6-hour intervals and for each interval, we calculated the average of each of the 4 intervals.
In each groups of features, for each csv file a new dataset with one row is created. This dataset contains a number column which is the name of the file and features columns, which contain the information of activity. Then we concatenate these datasets and add them to the score dataset.
Code
# In this chunk of code, We read all the datasets, extract the weekday information and store all of them in a new list.weekday_df = {}weekday_list = []for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df = smoothed_series(df,15) df = reduced_series(df,interval='D') df = pd.DataFrame(df.groupby('weekday')['activity'].mean()).reset_index() df.insert(0, 'number', name) df = df.pivot(index='number',columns='weekday', values='activity') df = df.reset_index() df.columns.name =None weekday_df[name] = df weekday_list.append(df)
Code
# createing a new dataframe. In each row, we have the average daily activity of each patient. Then we merge it with scoredfweekday = pd.concat(weekday_list, ignore_index=True)weekday_model = pd.merge(scoredf, weekday, on='number')weekday_model.drop('number', axis=1, inplace=True)weekday_model.head()
gender
age_numeric
depressed
Friday
Monday
Saturday
Sunday
Thursday
Tuesday
Wednesday
0
1
37
1
105.494029
242.641250
156.362685
98.647130
132.780756
109.855370
177.003700
1
1
42
1
202.063437
127.565104
121.399491
79.149780
236.484537
141.071023
147.353673
2
0
47
1
332.303032
217.750541
290.184491
336.884884
208.149074
190.160179
232.125741
3
1
27
1
363.742338
356.325810
197.312894
237.813449
287.071736
201.662511
244.064919
4
1
52
1
118.540284
193.691644
222.639398
150.166505
124.373599
174.109815
177.934329
Code
# In this chunk of code, We read all the datasets, extract the hourly information and store all of them in a new list.hourly_df = {}hourly_list = []for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df = smoothed_series(df,15) df = reduced_series(df,interval='H') df = pd.DataFrame(df.groupby('time')['activity'].mean()).reset_index() df.insert(0, 'number', name) df = df.pivot(index='number',columns='time', values='activity') df = df.reset_index() df.columns.name =None hourly_df[name] = df hourly_list.append(df)
Code
# createing a new dataframe. In each row, we have the average hourly activity of each patient. Then we merge it with scoredfhourly = pd.concat(hourly_list, ignore_index=True)hourly_model = pd.merge(scoredf, hourly, on='number')hourly_model.drop('number', axis=1, inplace=True)hourly_model.head()
gender
age_numeric
depressed
0
1
2
3
4
5
6
...
14
15
16
17
18
19
20
21
22
23
0
1
37
1
42.086042
20.226597
9.259028
4.861389
9.888681
6.521806
8.576319
...
306.673333
216.300131
230.481250
315.454792
310.892708
218.033472
133.973889
90.446597
69.544861
67.261319
1
1
42
1
22.031317
25.147613
21.849095
22.585185
32.760123
63.863498
243.340700
...
280.893909
221.887110
136.397490
95.405226
85.823210
49.972305
37.916584
35.281111
27.866091
17.148313
2
0
47
1
221.818963
131.104741
91.317704
51.941185
24.730222
24.244296
27.849852
...
459.890889
650.825998
540.470593
489.053259
423.790444
345.197333
341.871630
370.340889
251.144370
245.465185
3
1
27
1
177.992370
104.374815
88.671333
72.790815
62.562963
56.131407
76.453926
...
491.270963
505.056741
345.716593
396.878074
386.071407
364.061852
434.668741
456.369481
322.053481
240.010519
4
1
52
1
112.257111
63.636370
52.562370
23.144741
24.400889
12.841778
14.168667
...
442.922741
378.891926
229.858000
200.159333
142.097037
213.357556
231.114815
208.187481
157.570074
122.747778
5 rows × 27 columns
Code
# In this chunk of code, We read all the datasets, extract the Interval information and store all of them in a new list.interval_df = {}interval_list = []for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df = smoothed_series(df,15) df = reduced_series(df,interval='6H') df = pd.DataFrame(df.groupby('time')['activity'].mean()).reset_index() df.insert(0, 'number', name) df = df.pivot(index='number',columns='time', values='activity') df = df.reset_index() df.columns.name =None interval_df[name] = df interval_list.append(df)
Code
# createing a new dataframe. In each row, we have the average hourly activity of each patient.#| include: Falseinterval = pd.concat(interval_list, ignore_index=True)interval = interval.rename(columns = {0:'0-6', 6 : '6-12', 12 : '12-18', 18 : '18-24'})
Code
# createing a new dataframe. In each row, we have the average interval activity of each patient. Then we merge it with scoredfinterval_model = pd.merge(scoredf, interval, on='number')interval_model.drop('number', axis=1, inplace=True)interval_model.head()
gender
age_numeric
depressed
0-6
6-12
12-18
18-24
0
1
37
1
15.473924
123.541921
288.133050
148.358808
1
1
42
1
31.372805
326.989067
210.366153
42.334602
2
0
47
1
90.859519
126.614210
486.080241
329.634975
3
1
27
1
93.753951
198.832596
437.736806
367.205914
4
1
52
1
48.140543
86.690588
348.279358
179.179123
5.4 Fourier Transformation
The Fast Fourier Transform (FFT) is a mathematical algorithm that takes a time-domain signal and converts it into the frequency-domain. It represents the dominant frequency component in the input signal. The peak in the FFT data represents the frequency at which the input signal has the most energy or the highest amplitude. We use these transformed data later in a plot to see if our data has periodicity.
Now that the data set is clean and we have converted the variables into numeric, we can visualise the data for better understanding. The countplot of the Depression condition regarding the gender shows that the number of women in the sample of control group was more than the number of men. However, the distribution was vice versa in condition group.
Code
# Gender and Depression Plot# plt.figure(figsize=(1,1))valuetable = pd.crosstab(scoredf['depressed'],scoredf['gender']) valuetable.plot.bar(stacked=True, figsize=(3.3, 3))plt.title('Gender and Depression',fontsize=14, y=1.02)plt.xlabel('Depression Condition')plt.ylabel('Count')plt.show()
6.1 Pairplots and Heatmaps
A pair plot is a graphical representation of the pairwise relationships between variables in a dataset. It is a matrix of scatter plots and histograms. Pair plots are useful for identifying potential relationships and patterns between variables.
Heatmap is a graphical representation of a correlation matrix. A correlation matrix is a table that shows the correlation coefficients between different variables in a dataset. Heatmaps are useful for identifying which variables are highly correlated with each other. Both pair plots and heatmaps are used in this part of EDA to visualize the relationships between variables and to identify patterns and trends in the data.
6.1.1 Hourly Activity
The pairplot of the Hourly activity depicts that the correlation between the body activity in sleeping time (12 pm until 7 am) and other times of the day is not strong. Accordingly, by comparing the scatter plots of each hour, we can see that there is a strong correlation between one specific hour and one hour before and after it. But for 6 am and 7 am which are the wake up times, the same pattern does not exist.
Code
# Pairplot of Hourly Activityplt.figure(figsize=(6,5))p = sns.pairplot(hourly_model.iloc[:,3:], height=1.2, aspect=1, diag_kws={"bins": 15, "alpha": 0.7, "edgecolor": "k"});p.fig.suptitle("Pair Plot of weekday activities",fontsize=30, y=1.02);plt.show()
<Figure size 600x500 with 0 Axes>
Code
# heatmap of Hourly Activityplt.figure(figsize=(10,8))sns.heatmap(hourly_model.iloc[:,3:].corr(), annot=True, annot_kws={"size": 5}, linewidth=1, linecolor="black")plt.title('Heatmap of Weekday Activities',fontsize=14, y=1.02)plt.show()
6.1.2 Interval Activity
The pairplot of the interval activity shows the same pattern as hourly activities. The linear correalation between the activitis between 12 pm until 6 am (sleeping time) and the activities of other time intervals is not strong (les than 0.45). reagrding the heatmap of weekday activity, the correlation coefficient between the weekdays is at least 0.7.
Code
# Pairplot of Weekday Activityp = sns.pairplot(interval_model.iloc[:,3:], height=1.2, aspect=1, diag_kws={"bins": 15, "alpha": 0.7, "edgecolor": "k"});p.fig.suptitle("Pair Plot of Interval activities",fontsize=15, y=1.02);plt.show()
Code
# heatmap of Weekday Activityplt.figure(figsize=(4,3))sns.heatmap(interval_model.iloc[:,3:].corr(), annot=True, annot_kws={"size": 5}, linewidth=1, linecolor="black")plt.title('Heatmap of interval Activities',fontsize=14, y=1.02)plt.show()
6.1.3 Weekday Activity
The pairplot of the weekday activity shows that there is a linear correalation between the activity for each day. Regrding the heatmap the weekday activities are highly correlated (with the correlation coefficient between of least 0.7).
Code
# Pairplot of Weekday Activityp = sns.pairplot(weekday_model.iloc[:,3:], height=1.2, aspect=1, diag_kws={"bins": 15, "alpha": 0.7, "edgecolor": "k"});p.fig.suptitle("Pair Plot of weekday activities",fontsize=15, y=1.02);plt.show()
Code
# heatmap of Weekday Activityplt.figure(figsize=(6,5))sns.heatmap(weekday_model.iloc[:,3:].corr(), annot=True, annot_kws={"size": 5}, linewidth=1, linecolor="black")plt.title('Heatmap of Weekday Activities',fontsize=14, y=1.02)plt.show()
6.2 Plots for each timeseries
In the following, we created 6 plots for each datasets. We used compare_plot() that creates a figure with 6 subplots to compare different visualizations of a given time series dataset.
The first subplot is a line plot on the original raw data. It depicts the periodic pattern but also some noises in each dataset.
The second subplot is a line plot on the smoothed data using a moving average with a window size of 60 minutes. The smoothed data is less noisy and easier to interpret, revealing more clear patterns over time.
The third subplot is the line plot on the fast fourier transformed data. Here we have frequency domain instead of time domain.
The fourth subplot shows a histogram of the raw data using a specified number of bins. All of the histograms appear to be right skewed, with a few outliers towards the higher end.
The fifth subplot shows a line plot of the average activity levels per weekday. There is not a clear pattern in the weekday activities of the individuals comparing to eachother.
The sixth subplot shows the average activity levels per hour, aggregated over hourly intervals. It appears to be a clear pattern of higher activity levels during the daytime hours (roughly 8am to 8pm), with lower activity levels during the nighttime hours (8 pm to 8 am).This plot depicts almost the same trend for all the individuals. This makes sense, as most people are more active during the day than at night.
Code
# We use the function we created before to plot each of our csv files in condition and control dataset.for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] compare_plot(df, window=60, interval='H', title=name) plt.suptitle(i) plt.show()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
6.3 Comparision between Condition and Control group
In order to see the differences between the two groups of condition and control, We compared the activity variable in hourly, interval and the weekday basis. To do so, we create a new aggregated dataset which contains hourly average of activities with a week day and time label for each of them.
6.3.1 Heatmaps
In order to better understand the data, we use heatmap to visualize the differences between two groups.
The Hourly Activity Heatmap heatmap compares the average of hourly activities of groups. The plot depicts that the average of activity starts to increase at 6 am for control group and at 8 am for condition group. Generaly, the activity amount in control group is higher than condition group.
Code
# In this chunk of code, We read all the files in condition and control foldersdfnames = {}all= []for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df = smoothed_series(df,15) df = reduced_series(df,interval='H') df.insert(0, 'number', name) dfnames[name] = dfall.append(df)
Code
# Creating a csv file by concatenating all the csv files and Adding 2 colunms to our aggregated datasetagg_csv = pd.concat(all, ignore_index=True)agg_csv['weekday'] = agg_csv['timestamp'].dt.day_name()agg_csv['depressed'] = pd.to_numeric(np.where((agg_csv['number'].str.contains('control')) ,'0', np.where((agg_csv['number'].str.contains('condition')),'1',agg_csv['number'])))
The Weekday Activity Heatmap compares the average of weekday activities of groups. The plot shows that except Mondays and Fridays, the average of activity amount for each weekday in condition group is less than control group.
In the Weekday and Daytime Activity Heatmap compares the activity for each weekday and day time in control and condition groups. Still we can observed that amount of activity in control group is usually higher that condition group in all weekdays. The heatmap for the “Condition” group appears to show lower levels of activity during weekdays and higher levels during the weekend (Saturday and Sunday). There is a noticeable dip in activity during weekday afternoons (12:00 to 16:00). We can observe that for Condition group, people generally have very less activities on tuesdays and Sundays. Whereas in Control group, we observe increase in activity during the weekends, specially Saturday between 11:00 to 17:00 hours. Activities also looks higher on mondays between the afternoon hours. However for Control group Friday looks like hoghest in activity during 13:00 to 18:00 hours.
Code
# plotting the heatmap for Hourly Activity per each weekday#| code-fold: False#| label:fig-anyname#| fig-cap: "plots for each individual"#| include: Truefig, axs = plt.subplots(ncols=2, figsize=(20, 5))weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']heatmap_condition = (agg_csv.loc[agg_csv['depressed']==1]).groupby(['weekday','time'])['activity'].mean()heatmap_condition = heatmap_condition.reset_index()heatmap_condition['weekday'] = pd.Categorical(heatmap_condition['weekday'], categories=weekday_order, ordered=True)sns.heatmap(heatmap_condition.pivot_table(index='time', columns='weekday', values='activity'), linewidth=0.01, linecolor="white", cmap='PiYG', ax=axs[0],vmin=0, vmax=350)axs[0].set_title('Condition')heatmap_control = (agg_csv.loc[agg_csv['depressed']==0]).groupby(['weekday','time'])['activity'].mean()heatmap_control = heatmap_control.reset_index()heatmap_control['weekday'] = pd.Categorical(heatmap_control['weekday'], categories=weekday_order, ordered=True)sns.heatmap(heatmap_control.pivot_table(index='time', columns='weekday', values='activity'), linewidth=0.01, linecolor="white", cmap='PiYG', ax=axs[1],vmin=0, vmax=350)axs[1].set_title('Control')plt.suptitle('Weekday and Daytime Activity Heatmap',fontsize=18)plt.show()
6.3.2 Lineplots
In the following, we compare the weekday activity and hourly activity in lineplots.
In Weekday Activity Lineplot, it can be seen that in all weekdays, the amount of activity for control group is more than condition group. The weekday activity level for condition group has its minimum in Sunday, but the control group has its minimum on Wednesday. The both groups has their most activity level on Friday.
Code
# Comparing the mean of weekday of 24 hours for 2 groupsplt.figure(figsize=(7, 3))weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']lineplot_condition = (agg_csv.loc[agg_csv['depressed']==1]).groupby(['weekday'])['activity'].mean()lineplot_condition = lineplot_condition.reset_index()lineplot_condition['weekday'] = pd.Categorical(lineplot_condition['weekday'], categories=weekday_order, ordered=True)lineplot_condition = lineplot_condition.sort_values('weekday')lineplot_control = (agg_csv.loc[agg_csv['depressed']==0]).groupby(['weekday'])['activity'].mean()lineplot_control = lineplot_control.reset_index()lineplot_control['weekday'] = pd.Categorical(lineplot_control['weekday'], categories=weekday_order, ordered=True)lineplot_control = lineplot_control.sort_values('weekday')plt.plot(lineplot_condition['weekday'], lineplot_condition['activity'], 'g-o', label='Condition')plt.plot(lineplot_control['weekday'], lineplot_control['activity'], 'y-o', label='Control')plt.xlabel('Weekday')plt.ylabel('Mean of activities')plt.legend(bbox_to_anchor=(1, 1), loc='upper right')plt.suptitle('Weekday Activity Lineplot', fontsize=14)plt.grid(True)plt.show()
The Hourly Activity Lineplot plot depicts that both control and condition group follow the same trend but overall, the number of activity per hour for control group is more than condition group.
we can see that the mean activity level is generally higher during the daytime (between 7:00 and 20:00 hours) and lower during the nighttime (between 21:00 and 6:00 hours) for both groups. However, the activity levels of the condition group are generally lower than those of the control group throughout the day. Overall, the line plot clearly shows the differences in mean activity levels between the two groups throughout the day, with the control group showing higher activity levels than the condition group.
Code
# Comparing the mean of activities of 24 hours for 2 groupsplt.figure(figsize=(10, 3))lineplot_condition = (agg_csv.loc[agg_csv['depressed']==1]).groupby(['time'])['activity'].mean()lineplot_condition = lineplot_condition.reset_index()plt.plot(lineplot_condition['time'],lineplot_condition['activity'], 'k-o',label ='Condition')lineplot_control = (agg_csv.loc[agg_csv['depressed']==0]).groupby(['time'])['activity'].mean()lineplot_control = lineplot_control.reset_index()plt.plot(lineplot_control['time'],lineplot_control['activity'], 'r-o',label ='Control')plt.xlabel('Time')plt.ylabel('Mean of activities')plt.legend(bbox_to_anchor=(1, 1), loc='upper right')plt.suptitle('Hourly Activity Lineplot',fontsize=14)plt.grid(True)plt.show()
6.3.3 Boxplots
In the following, we use boxplots for weekday, hourly and interval activity of each groups. Here we can understand the distribution of each feature and detect the outliers. Based on the generated boxplots, we understand that both groups have some outliers in the upper range of scores. This suggests that there may be a subset of individuals in both groups who experience occasional very high scores. Overall, the boxplot suggests that while there may be some differences in the distribution of data between the groups, both appear to have similar patterns.
Code
# Weekday Boxplots for each groupfig, axs = plt.subplots(ncols=2, figsize=(20, 5))sns.boxplot(data = weekday_model.loc[weekday_model['depressed']==1].iloc[:,3:], palette='Set3', ax=axs[0])axs[0].set_title('Condition')axs[0].set_ylim(0,500)sns.boxplot(data = weekday_model.loc[weekday_model['depressed']==0].iloc[:,3:], palette='Set3', ax=axs[1])axs[1].set_title('Control')axs[1].set_ylim(0,500)plt.suptitle('Weekday Boxplot',fontsize=18)plt.show()
# Hourly Boxplots for each groupfig, axs = plt.subplots(ncols=2, figsize=(20, 5))sns.boxplot(data = hourly_model.loc[hourly_model['depressed']==1].iloc[:,3:], palette='Set3', ax=axs[0])axs[0].set_title('Condition')axs[0].set_ylim(0,900)sns.boxplot(data = hourly_model.loc[hourly_model['depressed']==0].iloc[:,3:], palette='Set3', ax=axs[1])axs[1].set_title('Control')axs[1].set_ylim(0,900)plt.suptitle('Hourly Boxplot',fontsize=18)plt.show()
7 Feature Selection using tsfresh library
In this part of study, we use tsfresh package to create a new dataset. This library can be used to extract a range of features from time series data, including statistical features (such as mean, variance, and autocorrelation), Fourier transform-based features (such as frequency domain power spectra and peak frequencies), and features based on time-series models (such as ARIMA and state space models). In the following, we will extract a selection of statistical features (mean, standard deviation, variance, skewness, kurtosis, maximum, minimum, median, absolute sum of changes, longest strike above mean, longest strike below mean, count above mean, count below mean and autocorrelation) from our sensor data for each person using tsfresh. For each dataset 92 features are created. Then we clean the new dataset by removing the features with standard deviation of less than 0.05. After cleaning the near to zero values, 35 columns are removed. Then, we scale the data between 0 and 1.
The features creates some features over the index of each dataset. Regarding our study, these features are not informative and we remove them from our dataset.
# Creating a function for extracting statistical features using tsfreshfrom tsfresh.feature_extraction import extract_features, EfficientFCParametersdef feature_extraction(df): features = ['mean','standard_deviation','variance','skewness','kurtosis','maximum','minimum','median','absolute_sum_of_changes','longest_strike_above_mean','longest_strike_below_mean','count_above_mean','count_below_mean','autocorrelation'] fc_parameters = EfficientFCParameters() fc_parameters_filtered = {feat: fc_parameters[feat] for feat in features if feat in fc_parameters} extracted_features = extract_features(df, column_id="number", column_sort="timestamp",default_fc_parameters=fc_parameters_filtered) extracted_features.reset_index(inplace=True) extracted_features.rename(columns={'index':'Datafile'},inplace=True)return extracted_features
Code
# tsfresh features# First of all, we select some importent feature of tsfresh, otherwise, we will have more than 750 features which would be hard to deal with.tsfresh_df = {}tsfresh_list = []for i,name inenumerate(list(filled_dfs.keys())): df = filled_dfs[list(filled_dfs.keys())[i]] df = df.reset_index() df = df.reset_index() df.columns.name =None df = feature_extraction(df) tsfresh_df[name] = df tsfresh_list.append(df)
#checking for missing valuesprint("Sum of the missing values in TSfresh features: ", sum(tsfeatures.isna().sum()))
Sum of the missing values in TSfresh features: 0
Code
tsfresh_score = pd.merge(scoredf, tsfeatures, on ='number').drop('number',axis=1)
Code
# removing zerovariance columnscols_to_drop = []for col in tsfresh_score.columns:if tsfresh_score[col].std() <0.05: cols_to_drop.append(col)tsfresh_score = tsfresh_score.drop(cols_to_drop, axis=1)tsfresh_score.head()
gender
age_numeric
depressed
index__mean
index__standard_deviation
index__variance
index__maximum
index__median
index__absolute_sum_of_changes
index__longest_strike_above_mean
...
fft__count_below_mean
fft__autocorrelation__lag_1
fft__autocorrelation__lag_2
fft__autocorrelation__lag_3
fft__autocorrelation__lag_4
fft__autocorrelation__lag_5
fft__autocorrelation__lag_6
fft__autocorrelation__lag_7
fft__autocorrelation__lag_8
fft__autocorrelation__lag_9
0
1
37
1
11621.5
6709.964822
4.502363e+07
23243.0
11621.5
23243.0
11622.0
...
15903.0
0.428845
0.383771
0.384540
0.437382
0.381491
0.392197
0.446910
0.384587
0.436016
1
1
42
1
19462.5
11236.968286
1.262695e+08
38925.0
19462.5
38925.0
19463.0
...
26739.0
0.581278
0.468652
0.434727
0.446788
0.362825
0.355307
0.323393
0.327928
0.310079
2
0
47
1
10823.5
6249.239307
3.905299e+07
21647.0
10823.5
21647.0
10824.0
...
15281.0
0.323589
0.355088
0.309398
0.274155
0.293437
0.326220
0.275815
0.286995
0.328600
3
1
27
1
10777.5
6222.681195
3.872176e+07
21555.0
10777.5
21555.0
10778.0
...
14375.0
0.512339
0.385585
0.377913
0.389713
0.404378
0.358831
0.308601
0.317345
0.370275
4
1
52
1
10746.0
6204.494661
3.849575e+07
21492.0
10746.0
21492.0
10746.0
...
14694.0
0.368063
0.352426
0.335389
0.337658
0.306190
0.347292
0.341690
0.325995
0.346490
5 rows × 60 columns
Code
tsfresh_score.describe().transpose()
count
mean
std
min
25%
50%
75%
max
gender
55.0
4.545455e-01
5.025189e-01
0.000000e+00
0.000000e+00
0.000000e+00
1.000000e+00
1.000000e+00
age_numeric
55.0
4.036364e+01
1.224951e+01
2.200000e+01
3.200000e+01
4.200000e+01
4.950000e+01
6.700000e+01
depressed
55.0
4.181818e-01
4.978066e-01
0.000000e+00
0.000000e+00
0.000000e+00
1.000000e+00
1.000000e+00
index__mean
55.0
1.429046e+04
5.361588e+03
9.649000e+03
1.083050e+04
1.162150e+04
1.573900e+04
3.270300e+04
index__standard_deviation
55.0
8.250892e+03
3.095514e+03
5.571141e+03
6.253281e+03
6.709965e+03
9.087205e+03
1.888137e+04
index__variance
55.0
7.748520e+07
6.781737e+07
3.103762e+07
3.910352e+07
4.502363e+07
8.257729e+07
3.565063e+08
index__maximum
55.0
2.858093e+04
1.072318e+04
1.929800e+04
2.166100e+04
2.324300e+04
3.147800e+04
6.540600e+04
index__median
55.0
1.429046e+04
5.361588e+03
9.649000e+03
1.083050e+04
1.162150e+04
1.573900e+04
3.270300e+04
index__absolute_sum_of_changes
55.0
2.858093e+04
1.072318e+04
1.929800e+04
2.166100e+04
2.324300e+04
3.147800e+04
6.540600e+04
index__longest_strike_above_mean
55.0
1.429065e+04
5.361537e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
index__longest_strike_below_mean
55.0
1.429065e+04
5.361537e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
index__count_above_mean
55.0
1.429065e+04
5.361537e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
index__count_below_mean
55.0
1.429065e+04
5.361537e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
freq__longest_strike_above_mean
55.0
1.429087e+04
5.361626e+03
9.650000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
freq__longest_strike_below_mean
55.0
1.429065e+04
5.361537e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
freq__count_above_mean
55.0
1.429087e+04
5.361626e+03
9.650000e+03
1.083050e+04
1.162200e+04
1.573900e+04
3.270300e+04
freq__count_below_mean
55.0
1.429091e+04
5.361638e+03
9.649000e+03
1.083050e+04
1.162200e+04
1.574000e+04
3.270400e+04
activity__mean
55.0
1.899132e+02
8.050574e+01
5.469799e+01
1.423017e+02
1.722886e+02
2.433783e+02
3.988835e+02
activity__standard_deviation
55.0
3.406043e+02
9.980013e+01
1.361343e+02
2.851245e+02
3.307975e+02
4.119955e+02
6.243622e+02
activity__variance
55.0
1.257903e+05
7.204672e+04
1.853256e+04
8.129622e+04
1.094270e+05
1.697508e+05
3.898282e+05
activity__skewness
55.0
3.176805e+00
9.595812e-01
1.641789e+00
2.426802e+00
2.818412e+00
3.843546e+00
5.616192e+00
activity__kurtosis
55.0
1.726993e+01
1.277763e+01
2.892463e+00
8.630075e+00
1.164787e+01
2.486066e+01
5.884009e+01
activity__maximum
55.0
4.834345e+03
1.700861e+03
1.355000e+03
3.574000e+03
4.609000e+03
6.024000e+03
8.000000e+03
activity__minimum
55.0
5.454545e-02
4.045199e-01
0.000000e+00
0.000000e+00
0.000000e+00
0.000000e+00
3.000000e+00
activity__median
55.0
2.889091e+01
3.808843e+01
0.000000e+00
0.000000e+00
9.000000e+00
4.750000e+01
1.410000e+02
activity__absolute_sum_of_changes
55.0
2.791822e+06
9.314838e+05
1.035370e+06
2.122212e+06
2.872270e+06
3.238874e+06
5.966645e+06
activity__longest_strike_above_mean
55.0
1.306727e+02
1.161648e+02
3.000000e+01
7.850000e+01
1.020000e+02
1.405000e+02
8.210000e+02
activity__longest_strike_below_mean
55.0
4.822055e+03
6.737933e+03
1.870000e+02
7.525000e+02
1.367000e+03
5.504000e+03
3.356800e+04
activity__count_above_mean
55.0
7.466327e+03
1.694047e+03
3.842000e+03
6.330500e+03
7.007000e+03
8.759000e+03
1.124700e+04
activity__count_below_mean
55.0
2.111560e+04
9.550260e+03
1.350200e+04
1.514900e+04
1.677400e+04
2.258950e+04
5.484200e+04
activity__autocorrelation__lag_2
55.0
6.673653e-01
6.568015e-02
4.434379e-01
6.286960e-01
6.727768e-01
7.115975e-01
8.017300e-01
activity__autocorrelation__lag_3
55.0
6.077184e-01
7.588741e-02
3.705619e-01
5.626888e-01
6.188648e-01
6.602492e-01
7.639179e-01
activity__autocorrelation__lag_4
55.0
5.643909e-01
8.262429e-02
3.174616e-01
5.181442e-01
5.666592e-01
6.286069e-01
7.373554e-01
activity__autocorrelation__lag_5
55.0
5.319031e-01
8.848471e-02
2.832638e-01
4.856612e-01
5.360520e-01
6.032284e-01
7.201411e-01
activity__autocorrelation__lag_6
55.0
5.041493e-01
9.143303e-02
2.559138e-01
4.568656e-01
4.983664e-01
5.784877e-01
6.986781e-01
activity__autocorrelation__lag_7
55.0
4.811607e-01
9.424126e-02
2.279063e-01
4.328695e-01
4.850774e-01
5.575946e-01
6.812188e-01
activity__autocorrelation__lag_8
55.0
4.613381e-01
9.697063e-02
2.071357e-01
4.095218e-01
4.640680e-01
5.375010e-01
6.654096e-01
activity__autocorrelation__lag_9
55.0
4.455279e-01
9.863420e-02
1.894618e-01
3.892091e-01
4.508276e-01
5.165660e-01
6.522154e-01
fft__mean
55.0
3.540192e+04
9.831960e+03
1.543033e+04
2.829112e+04
3.577489e+04
3.957435e+04
5.835364e+04
fft__standard_deviation
55.0
5.407938e+04
1.887448e+04
1.629249e+04
4.110384e+04
5.454124e+04
6.424589e+04
1.035372e+05
fft__variance
55.0
3.274348e+09
2.179592e+09
2.654452e+08
1.689639e+09
2.974747e+09
4.127736e+09
1.071996e+10
fft__skewness
55.0
3.923597e+01
9.730474e+00
1.778328e+01
3.376921e+01
3.902993e+01
4.553731e+01
6.410507e+01
fft__kurtosis
55.0
3.210458e+03
1.218441e+03
9.325536e+02
2.496411e+03
3.226171e+03
3.933307e+03
7.045172e+03
fft__maximum
55.0
5.134640e+06
2.121832e+06
1.413779e+06
3.564174e+06
5.132693e+06
6.276905e+06
1.255406e+07
fft__minimum
55.0
2.280393e+02
1.248990e+02
2.083425e+01
1.463562e+02
2.152947e+02
3.034413e+02
5.538023e+02
fft__median
55.0
2.582022e+04
6.948183e+03
1.199335e+04
2.054272e+04
2.711399e+04
2.900791e+04
4.218247e+04
fft__absolute_sum_of_changes
55.0
5.488147e+08
2.309935e+08
1.888553e+08
3.780757e+08
4.954812e+08
6.822667e+08
1.232146e+09
fft__longest_strike_above_mean
55.0
8.767273e+01
3.252808e+01
4.700000e+01
6.400000e+01
8.200000e+01
9.950000e+01
1.840000e+02
fft__longest_strike_below_mean
55.0
9.880000e+01
3.598271e+01
4.200000e+01
7.100000e+01
9.900000e+01
1.230000e+02
1.890000e+02
fft__count_above_mean
55.0
9.049618e+03
3.199748e+03
6.141000e+03
6.929000e+03
7.781000e+03
1.006500e+04
2.181100e+04
fft__count_below_mean
55.0
1.953231e+04
7.575604e+03
1.315800e+04
1.468800e+04
1.598700e+04
2.141400e+04
4.359600e+04
fft__autocorrelation__lag_1
55.0
4.678749e-01
1.287744e-01
2.256121e-01
3.669427e-01
4.409205e-01
5.529907e-01
7.651120e-01
fft__autocorrelation__lag_2
55.0
3.974753e-01
6.821929e-02
2.274958e-01
3.476014e-01
3.876729e-01
4.581997e-01
5.406930e-01
fft__autocorrelation__lag_3
55.0
3.675708e-01
6.360275e-02
2.073741e-01
3.181789e-01
3.729450e-01
4.138719e-01
5.156808e-01
fft__autocorrelation__lag_4
55.0
3.554635e-01
6.425523e-02
1.990177e-01
3.054515e-01
3.512956e-01
4.087008e-01
4.938941e-01
fft__autocorrelation__lag_5
55.0
3.449956e-01
6.232872e-02
2.154097e-01
2.959818e-01
3.381020e-01
3.912056e-01
4.680184e-01
fft__autocorrelation__lag_6
55.0
3.432152e-01
6.607012e-02
2.114513e-01
2.944385e-01
3.432333e-01
3.851774e-01
5.114441e-01
fft__autocorrelation__lag_7
55.0
3.344160e-01
6.494353e-02
1.984398e-01
2.856310e-01
3.250323e-01
3.765627e-01
4.816575e-01
fft__autocorrelation__lag_8
55.0
3.333697e-01
5.989195e-02
2.013843e-01
2.890481e-01
3.283283e-01
3.776024e-01
4.498979e-01
fft__autocorrelation__lag_9
55.0
3.394102e-01
6.153172e-02
2.165290e-01
2.900873e-01
3.345822e-01
3.913068e-01
4.571922e-01
Code
# removing all the columns which contain index in their nametsfresh_score= tsfresh_score.filter(regex='^(?!.*index)')
Code
# Scaling the data between 0 and 1from sklearn.preprocessing import MinMaxScalerfeatures = tsfresh_score.columns.valuesscaler = MinMaxScaler(feature_range = (0,1))scaledts = pd.DataFrame(scaler.fit_transform(tsfresh_score))scaledts.columns = featuresscaledts.head()
gender
age_numeric
depressed
freq__longest_strike_above_mean
freq__longest_strike_below_mean
freq__count_above_mean
freq__count_below_mean
activity__mean
activity__standard_deviation
activity__variance
...
fft__count_below_mean
fft__autocorrelation__lag_1
fft__autocorrelation__lag_2
fft__autocorrelation__lag_3
fft__autocorrelation__lag_4
fft__autocorrelation__lag_5
fft__autocorrelation__lag_6
fft__autocorrelation__lag_7
fft__autocorrelation__lag_8
fft__autocorrelation__lag_9
0
1.0
0.333333
1.0
0.085542
0.085582
0.085542
0.085578
0.268024
0.324545
0.183811
...
0.090183
0.376706
0.498966
0.574641
0.808352
0.657466
0.602499
0.877313
0.737193
0.912010
1
1.0
0.444444
1.0
0.425671
0.425696
0.425671
0.425678
0.287539
0.398714
0.244803
...
0.446186
0.659252
0.769982
0.737423
0.840253
0.583571
0.479532
0.441192
0.509201
0.388716
2
0.0
0.555556
1.0
0.050926
0.050967
0.050926
0.050965
0.611884
0.540193
0.380734
...
0.069748
0.181606
0.407385
0.330918
0.254808
0.308886
0.382570
0.273201
0.344492
0.465674
3
1.0
0.111111
1.0
0.048931
0.048972
0.048931
0.048970
0.639681
0.634753
0.485914
...
0.039983
0.531468
0.504760
0.553147
0.646694
0.748068
0.491276
0.388964
0.466616
0.638841
4
1.0
0.666667
1.0
0.047543
0.047584
0.047543
0.047582
0.326998
0.332997
0.190405
...
0.050463
0.264042
0.398887
0.415220
0.470164
0.359370
0.452814
0.505796
0.501423
0.540011
5 rows × 50 columns
We devide our tsfresh extraxted features data into 3 parts, First are those related to the activity, second is fastfourier transformed coefficients and the third one are features related to the frequency of the data. then we use tham as the input of oir machine learning algorithm.
8 Classification Models
In this part of the project, we want to train a classification machine learning model. We will use the created features as input of our models and compare the classification metrics to see which features and which models provide better results. For splitting the datasets into train and test, we generate random numbers for indcies. As the number of observation is not large, We set the proportion of the test dataset to 50%. Then we split each dataset using these indices. In this way we have the same rows as test and train samples and we keep the consistancy through out the process.
We created before 6 different datasets for hourly, interval and weekday activity features and tsfresh extracted features for activity, frequency and FFT. We fit our classification model seperately on these datasets and compare the results. The depressed feature is the target variable. We trained different classifiers such as Logidtics Regression, Linear Discriminant Analysis (LDA), Decision Tree, K-Nearest Neighbor, Random Forest Classifier, GaussianNB, Support Vector Machine (SVM) and Multilayer Perceptron (MLP). For each classifier, different classification metrics were computed. In the following, a brief explanation of each metric is presented:
Accuracy Represents the percentage of correctly classified positive and negative samples. Can be misleading for imbalanced datasets and should be interpreted in combination with other metrics.
Precision Depicts the fraction of true positives among those classified as positives. It is also called positive predictive value.
Recall This metric is the correctly classified relevant samples. This tells us about how well the model identifies true positives.
F1-score Harmonic mean of precision and recall.
ROC AUC stands for curves receiver or operating characteristic curve. It illustrates in a binary classifier system the discrimination threshold created by plotting the true positive rate vs false positive rate. The perfect point is at the left top, where there are a lot of true positives and less false positives.
For each classifier, different classification metrics include accuracy, precision, recall, F1 score, and ROC AUC were computed. The results are in the following table. By comparing the models based on overall performance (ROC AUC and Accuracy), the SVM model trained by Hourly Activity features had the highest Accuracy and ROC AUC of 81% and 80%, respectively. Regarding to the confudion matrix, SVM has predicted 10 out of 11 non-depressed individual and 12 out of 16 depressed individual correctly. But 4 depressed individuals were classified as non-depressed and 1 non-depressed individual was classified as depressed one. The second highest Accuracy and ROC AUC of 81% and 78% was for the Support Vector Machine on Hourly Activity. The third highest Accuracy of 78% and ROC AUC of 70% were again for the Logistic Regression model trained by TSfresh Frequency features and Decision Tree and Random Forest on TSfresh FFT Feature.
In conclusion, the analysis and classification of depression using the dataset obtained from the provided link have provided valuable insights. The activity data extracted from the sensor data showed promising potential as a feature for diagnosing depression in patients. The results obtained through machine learning algorithms demonstrated the feasibility of performing classification based on the available features.
However, it is important to note that the limited number of observations available for training the model, which amounted to 55, may have impacted the overall reliability and generalizability of the results. As the numbetr of observations is small, the ratio of splitting data into train and test affects the result of the classification metrics. With a larger dataset, the model’s performance could potentially be improved, and the results could be more robust and accurate. Additionally, it is worth considering that depression is a complex mental health condition influenced by various factors, and activity level alone might not capture the complete picture. Incorporating additional information, such as psychological questionnaire data, could enhance the accuracy and reliability of the classification results.
Furthermore, it is essential to acknowledge that the dataset used in this analysis was collected in 2010. Given the advancements in technology and data collection techniques, more accurate and comprehensive data can now be obtained. Access to up-to-date and high-quality data would greatly contribute to better understanding and diagnosing depression. To conclude, while the activity data extracted from the sensor data holds promise as a feature for depression diagnosis, further research with larger datasets and the inclusion of diverse data sources is warranted to enhance the reliability and accuracy of classification models in diagnosing depression.
10 References
Arora, A., & Pinaki. (2023). Identifying digital biomarkers in actigraph based sequential motor activity data for assessment of depression: a model evaluating SVM in LSTM extracted feature space. International Journal of Information Technology, 15(2), 797-802. https://doi.org/10.1007/s41870-023-01162-5
Berle, J. O., Hauge, E. R., Oedegaard, K. J., Holsten, F., & Fasmer, O. B. (2010). Photographic registration of motor activity reveals a more structured behavioral pattern in schizophrenia than in major depression. BMC Research Notes, 3(149). https://doi.org/10.1186/1756-0500-3-149
Garcia-Ceja, E., Riegler, M., Jakobsen, P., Tørresen, J., Nordgreen, T., Oedegaard, K. J., & Fasmer, O. B. (2018). Depresjon: a motor activity database of depression episodes in unipolar and bipolar patients. In MMSys ’18: Proceedings of the 9th ACM Multimedia Systems Conference (pp. 472–477). https://doi.org/10.1145/3204949.3208125